In this exercise we will look at different adaptive Metropolis
algorithms. The aim is to get an intuition for what their “adaptations”
are doing, by examining their behaviour with our simple
monod model. You are encouraged to play with each of the
adaptation parameters of each algorithm and to check how it influences
the resulting sample chain.
You can use you own code from exercises 2 or the functions defined below. And of course you can also do the exercises with the growth model.
## load the monod model
source("../../models/models.r")
## read data
data.monod <- read.table("../../data/model_monod_stoch.csv", header=T)
## Logprior for model "monod": lognormal distribution
prior.monod.mean <- 0.5 * c(r_max=5, K=3, sigma=0.5)
prior.monod.sd <- 0.5 * prior.monod.mean
logprior.monod <- function(par, mean, sd){
sdlog <- sqrt(log(1+sd*sd/(mean*mean)))
meanlog <- log(mean) - sdlog*sdlog/2
return(sum(dlnorm(par, meanlog=meanlog, sdlog=sdlog, log=TRUE)))
}
## Log-likelihood for model "monod"
loglikelihood.monod <- function(y, C, par){
## deterministic part:
y.det <- model.monod(par, C) # defined in `models.r`
## Calculate loglikelihood assuming independence:
return( sum(dnorm(y, mean=y.det, sd=par['sigma'], log=TRUE )) )
}
## Log-posterior for model "monod"
logposterior.monod <- function(par) {
lp <- logprior.monod(par, prior.monod.mean, prior.monod.sd)
if(is.finite(lp)){
return( lp + loglikelihood.monod(data.monod$r, data.monod$C, par) )
} else {
return(-Inf)
}
}
using DataFrames
import CSV
using Distributions
using ComponentArrays
monod_data = CSV.read("../../data/model_monod_stoch.csv", DataFrame)
## load monod model
include("../../models/models.jl");
## read data
monod_data = CSV.read("../../data/model_monod_stoch.csv", DataFrame)
# set parameters
prior_monod_mean = ComponentVector(r_max = 2.5, K=1.4, sigma=0.25);
prior_monod_sd = 0.5 .* prior_monod_mean;
## Use a lognormal distribution for all model parameters
function logprior_monod(par, m, sd)
μ = @. log(m/sqrt(1+sd^2/m^2))
σ = @. sqrt(log(1+sd^2/m^2))
return sum(logpdf.(LogNormal.(μ, σ), par)) # sum because we are in the log-space
end
## Log-likelihood for model "monod"
function loglikelihood_monod(par::ComponentVector, data::DataFrame)
y_det = model_monod(data.C, par)
return sum(logpdf.(Normal.(y_det, par.sigma), data.r))
end
## Log-posterior for model "monod"
function logposterior_monod(par::ComponentVector)
lp = logprior_monod(par, prior_monod_mean, prior_monod_sd)
if !isinf(lp)
lp += loglikelihood_monod(par, monod_data)
end
lp
end
We try the adaptive Metropolis with delayed rejection as described by (Haario, Saksman and Tamminen (2001).
What could you use as initial values?
What is a meaningful initial covariance matrix for the jump distribution?
Plot the chains and see how quick the convergence is.
Look at 2d marginal plots. What happens in these marginal plots if you don’t cut off a burn-in or only use the beginning of the chain?
It is implemented in the function modMCMC
of the package FME.
Note, this function expects the negative log density”, which is
sometimes called energy. See the modMCMC function
documentation
for details.
neg.log.post <- function(par) -logposterior.monod(par)
Now call the modMCMC function and investigate the
effects of the updatecov parameter (try values of 10, 100
and 1000), which determines how often the covariance of the jump
distribution is updated, and the ntrydr parameter (try
values of 1, 3 and 10), which determines the number of permissible jump
attempts before the step is rejected. In particular, examine the first
part of the chain and see how the adaptation works.
AMDR <- modMCMC(
f = neg.log.post,
p = par.init,
jump = jump.cov,
niter = 10000,
updatecov = 10,
ntrydr = 3
)
The package AdaptMCMC
provides multiple MCMC algorithms including the adaptive Metropolis.
using ComponentArrays
using AdaptiveMCMC
using MCMCChains
using Plots
using StatsPlots
θinit = ComponentVector(r_max=..., K=..., sigma=...)
res = adaptive_rwm(θinit, logposterior_monod,
10_000;
algorithm = :am,
b=1, # length of burn-in. Set to 1 for no burn-in.
);
# convert to MCMCChains for summary and plotting
chn = Chains(res.X', labels(θinit))
plot(chn)
corner(chn)
library(FME)
library(IDPmisc)
neg.log.post <- function(par) -logposterior.monod(par)
The mean of the prior seems to be a reasonable point to start the sampler. Alternatively, we could try to find the point of the maximum posterior density with an optimizer. For the covariance matrix of the jump distribution we use the standard deviation of the prior and assume independence.
par.init <- prior.monod.mean
jump.cov <- diag(prior.monod.sd/2)
par.init <- prior.monod.mean
jump.cov <- diag(prior.monod.sd/2)
AMDR <- FME::modMCMC(f = neg.log.post,
p = par.init,
jump = jump.cov,
niter = 10000,
updatecov = 10,
ntrydr = 3
)
## number of accepted runs: 8315 out of 10000 (83.15%)
## plot chains
plot(AMDR)
## 2d marginals
pairs(AMDR$pars)
IDPmisc::ipairs(AMDR$pars)
using ComponentArrays
using AdaptiveMCMC
using MCMCChains
using Plots
using StatsPlots
θinit = ComponentVector(r_max = 2.5, K=1.4, sigma=0.25) # use prior mean
## ComponentVector{Float64}(r_max = 2.5, K = 1.4, sigma = 0.25)
res = adaptive_rwm(θinit, logposterior_monod,
10_000;
algorithm = :am,
b=1, # length of burn-in. Set to 1 for no burn-in.
);
# convert to MCMCChains for summary and plotting
chn = Chains(res.X', labels(θinit))
## Chains MCMC chain (10000×3×1 reshape(adjoint(::Matrix{Float64}), 10000, 3, 1) with eltype Float64):
##
## Iterations = 1:1:10000
## Number of chains = 1
## Samples per chain = 10000
## parameters = r_max, K, sigma
##
## Summary Statistics
## parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
## Symbol Float64 Float64 Float64 Float64 Float64 Float64 Missing
##
## r_max 4.3779 0.3446 0.0149 541.8121 585.6963 1.0027 missing
## K 1.6100 0.4320 0.0209 501.1791 572.3889 1.0049 missing
## sigma 0.4715 0.0948 0.0041 565.1603 684.7064 1.0006 missing
##
## Quantiles
## parameters 2.5% 25.0% 50.0% 75.0% 97.5%
## Symbol Float64 Float64 Float64 Float64 Float64
##
## r_max 3.7914 4.1704 4.3700 4.5514 5.0537
## K 0.8977 1.3294 1.5569 1.8436 2.5686
## sigma 0.3536 0.4173 0.4599 0.5109 0.6391
plot(chn)
corner(chn)
The robust adaptive Metropolis algorithm proposed by Vihola (2012) is often a good choice. It adapts the scale and rotation of the covariance matrix until it reaches a predefined acceptance rate.
Did the algorithm reach the desired acceptance rate?
How is the covariance matrix after the adaptation different from the initial covariance that you provided?
The package adaptMCMC
provides the function MCMC. If the parameter
adapt is set to TRUE it implements the
adaptation propsoed by Vihola. Again examine the effect of the
adaptation settings on the chains, in particular examining the first
part of the chain to see how the adaptation works. Here the adaptation
is determined by parameter acc.rate. Try values between 0.1
and 1.
How is the influence on the burn-in? What happens, if you use a very bad initial value?
RAM <- MCMC(
p = logposterior.monod,
n = 10000,
init = par.start,
scale = jump.cov,
adapt = TRUE,
acc.rate = 0.5
)
str(RAM)
The package AdaptMCMC
provides multiple MCMC algorithms including the robust adaptive
Metropolis.
using ComponentArrays
using AdaptiveMCMC
using MCMCChains
using Plots
using StatsPlots
θinit = ComponentVector(r_max=..., K=..., sigma=...)
res = adaptive_rwm(θinit, logposterior_monod,
10_000;
algorithm = :ram,
b=1, # length of burn-in. Set to 1 for no burn-in.
);
# convert to MCMCChains for summary and plotting
chn = Chains(res.X', labels(θinit))
plot(chn)
corner(chn)
library(adaptMCMC)
library(IDPmisc)
The mean of the prior seems to be a reasonable point to start the sampler. Alternatively, we could try to find the point of the maximum posterior density with an optimizer. For the covariance matrix of the jump distribution we use the standard deviation of the prior and assume independence.
par.init <- prior.monod.mean
jump.cov <- diag(prior.monod.sd/2)
RAM <- adaptMCMC::MCMC(p = logposterior.monod,
n = 10000,
init = par.init,
scale = jump.cov,
adapt = TRUE,
acc.rate = 0.5,
showProgressBar = FALSE)
## generate 10000 samples
The acceptance rate is matched closely:
RAM$acceptance.rate
## [1] 0.495
The adapted covariance matrix has a lot of correlation between \(r_{max}\) and \(K\):
RAM$cov.jump
## [,1] [,2] [,3]
## [1,] 0.037535762 0.043590423 -0.001590077
## [2,] 0.043590423 0.067528414 -0.002160905
## [3,] -0.001590077 -0.002160905 0.002007727
cov2cor(RAM$cov.jump) # rescale as correlation matrix
## [,1] [,2] [,3]
## [1,] 1.0000000 0.8658152 -0.1831653
## [2,] 0.8658152 1.0000000 -0.1855838
## [3,] -0.1831653 -0.1855838 1.0000000
## plot chains
samp.coda <- convert.to.coda(RAM)
plot(samp.coda)
## 2d marginals
IDPmisc::ipairs(RAM$samples) # prettier versions of pairs()
using ComponentArrays
using AdaptiveMCMC
using MCMCChains
using Plots
using StatsPlots
θinit = ComponentVector(r_max = 2.5, K=1.4, sigma=0.25) # use prior mean
## ComponentVector{Float64}(r_max = 2.5, K = 1.4, sigma = 0.25)
res = adaptive_rwm(θinit, logposterior_monod,
10_000;
algorithm = :ram,
b=1, # length of burn-in. Set to 1 for no burn-in.
);
# convert to MCMCChains for summary and plotting
chn = Chains(res.X', labels(θinit))
## Chains MCMC chain (10000×3×1 reshape(adjoint(::Matrix{Float64}), 10000, 3, 1) with eltype Float64):
##
## Iterations = 1:1:10000
## Number of chains = 1
## Samples per chain = 10000
## parameters = r_max, K, sigma
##
## Summary Statistics
## parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
## Symbol Float64 Float64 Float64 Float64 Float64 Float64 Missing
##
## r_max 4.3775 0.3161 0.0125 653.1199 775.9973 1.0013 missing
## K 1.6075 0.4233 0.0174 603.2609 633.9303 1.0011 missing
## sigma 0.4702 0.0729 0.0028 796.2619 684.1226 1.0019 missing
##
## Quantiles
## parameters 2.5% 25.0% 50.0% 75.0% 97.5%
## Symbol Float64 Float64 Float64 Float64 Float64
##
## r_max 3.8330 4.1634 4.3508 4.5959 5.0202
## K 0.8974 1.3051 1.5693 1.8635 2.5400
## sigma 0.3536 0.4200 0.4610 0.5140 0.6250
plot(chn)
corner(chn)
Population based samplers do not have an explicit jump distribution. Instead, they run multiple chains (often called particles or walkers in this context) in parallel and the proposals are generated based on the position of the other particles.
A popular algorithm of this class is the Affine-Invariant MCMC population sampler proposed by Goodman and Weare (2010). The algorithm is often called EMCEE based on the python package with the same name.
Population based samplers are implemented in the package
mcmcensemble as MCMCEnsemble. It has two
methods: stretch move (method = "stretch") and
differential evolution
(method = "differential.evolution").
EMCEE <- MCMCEnsemble(
f = logposterior.monod,
lower.inits = par.start.lower,
upper.inits = par.start.upper,
max.iter = 10000,
n.walkers = n.walkers,
method = "stretch",
coda = FALSE
)
How do you choose par.start.lower and
par.start.upper? Better wide or narrow?
What is the influence of the number of walkers?
Which method works better in this case?
We use the package KissMCMC
which provides a function emcee:
using ComponentArrays
using KissMCMC: emcee
using Plots
using StatsPlots
# number of walkers (parallel chains)
n_walkers = 10
## We need a vector of inital values, one for each walkers.
## Make sure that they do not start from the same point.
θinits = [θinit .* rand(3) for _ in 1:n_walkers]
# Run sampler
samples, acceptance_rate, lp = emcee(logposterior_monod,
θinits;
niter = 10_000, # total number of density evaluations
nburnin = 0);
# This looks a bit ugly. It just converts the result into
# a `MCMCChains.Chains` object for plotting.
X = permutedims(
cat((hcat(samples[i]...) for i in 1:n_walkers)..., dims=3),
[2, 1, 3]);
chn = Chains(X, labels(θinit))
# plotting
plot(chn)
corner(chn)
How do you define the initial values? Very similar or very different?
What is the influence of the number of walkers?
library(mcmcensemble)
For each walker (chain) an initial starting point must be defined. In general, it is better to choose it in a region with high density. We could use an optimizer to fine the mode, but here we just use a rather wide coverage.
n.walkers <- 20
par.inits <- data.frame(r.max = runif(n.walkers, 1, 10),
K = runif(n.walkers, 0, 5),
sigma = runif(n.walkers, 0.05, 2))
EMCEE <- MCMCEnsemble(
f = logposterior.monod,
inits = par.inits,
max.iter = 10000,
n.walkers = n.walkers,
method = "stretch",
coda = TRUE
)
## Using stretch move with 20 walkers.
plot(EMCEE$samples)
Note, the more walkers (chains) we have, the shorter the chains. This means we have to “pay” the burn in for every single chain. Therefore, going too extreme with the number of chains is not beneficial.
n.walkers <- 1000
par.inits <- data.frame(r.max = runif(n.walkers, 1, 10),
K = runif(n.walkers, 0, 5),
sigma = runif(n.walkers, 0.05, 2))
EMCEE <- MCMCEnsemble(
f = logposterior.monod,
inits = par.inits,
max.iter = 10000,
n.walkers = n.walkers,
method = "stretch",
coda = TRUE
)
## Using stretch move with 1000 walkers.
plot(EMCEE$samples)
using ComponentArrays
using KissMCMC: emcee
using Plots
using StatsPlots
# number of walkers (parallel chains)
n_walkers = 10;
## We need a vector of inital values, one for each walkers.
θinit = ComponentVector(r_max = 2.5, K=1.4, sigma=0.25); # prior mean
## ComponentVector{Float64}(r_max = 2.5, K = 1.4, sigma = 0.25)
## We add some randomnesses to make sure that they do not start
## from the same point.
θinits = [θinit .* rand(Normal(0, 0.1), 3) for _ in 1:n_walkers];
# Run sampler
samples, acceptance_rate, lp = emcee(logposterior_monod,
θinits;
niter = 10_000, # total number of density evaluations
nburnin = 0);
# Converting into `MCMCChains.Chains` object for plotting.
X = permutedims(
cat((hcat(samples[i]...) for i in 1:n_walkers)..., dims=3),
[2, 1, 3]);
chn = Chains(X, labels(θinit))
## Chains MCMC chain (1000×3×10 Array{Float64, 3}):
##
## Iterations = 1:1:1000
## Number of chains = 10
## Samples per chain = 1000
## parameters = r_max, K, sigma
##
## Summary Statistics
## parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
## Symbol Float64 Float64 Float64 Float64 Float64 Float64 Missing
##
## r_max 4.3156 0.5886 0.0377 185.3439 491.6939 1.0506 missing
## K 1.5914 0.4869 0.0354 188.4623 468.0097 1.0486 missing
## sigma 0.4640 0.0845 0.0055 196.9108 581.4168 1.0427 missing
##
## Quantiles
## parameters 2.5% 25.0% 50.0% 75.0% 97.5%
## Symbol Float64 Float64 Float64 Float64 Float64
##
## r_max 3.6331 4.1426 4.3595 4.5621 5.0859
## K 0.8133 1.2776 1.5511 1.8584 2.6442
## sigma 0.3422 0.4187 0.4595 0.5095 0.6294
Note, that our chains are only of length 1000. So we have a lot of computation used for the burn-in phase.
# plotting
plot(chn)
corner(chn)
# removing burn-in:
corner(chn[250:end,:,:])
In many cases we want to make predictions for new model inputs. Let’s say we have observed the data \((y, x)\) and used it to infer the posterior \(p(\theta | y, x)\). We would now like to make predictions given a new input \(x^*\) using the learned posterior distribution:
\[ p(Y^* | x^*, y, x,) = \int p(Y^* | x^*, \theta) \, p(\theta | y, x) \text{d}\theta \]
For the monod model, what is \(p(Y^* | x^*, \theta)\)?
Produce predictions for the monod model for \(C = \{10, 12, 14, 16\}\) using a posterior sample from a exercises. We only need samples form the predictive distribution. How can you do this without solving the integral analytically?
Plot the result with 90% prediction interval.
How much do you trust that the interval is correct? What assumptions did you make?
The expression \(p(Y^* | x^*, \theta)\) is our probabilistic model, so the distribution of \(Y^*\) given some inputs and parameters (i.e. the likelihood function).
Typically we do not have the posterior distribution \(p(\theta | y, x)\) in analytical form but a sample from it. Hence we cannot compute the integral over all \(\theta\) (besides that this would be difficult anyway). Instead, we use an approach to obtain samples from \(Y^*\):
Both steps are computationally cheap. For 1) we already have samples from the parameter inference, and 2) is simply a forward simulation of the model as we did in exercises 1. Hence, producing posterior predictions is much cheaper than inference.
(Technically we sample from the joint distribution \(p(Y^*, \theta| x^*, y, x)\)). The marginalization over \(\theta\) is done by simply ignoring the sampled \(\theta\)s and looking only at \(Y^*\).)
It is important to keep in mind that the resulting prediction intervals are only correct as long the underlying model is correct! For example, it seems rather risky to extrapolate with the monod model to large concentrations outside of the calibration data range.
We take the function for forward simulations from exercises 1:
simulate.monod.stoch <- function(par, C){
## run model
r.det <- model.monod(par, C)
## generate noise
z <- rnorm(length(C), 0, par["sigma"])
return(r.det + z)
}
m <- 1000 # number of samples
Cstar <- c(10, 12, 14, 16) # new inputs
## posterior samples, removing burn-in
post.samples <- RAM$samples[1000:10000,]
Ystar <- matrix(NA, ncol=length(Cstar), nrow=m)
colnames(Ystar) <- paste0("C_", Cstar)
for(k in 1:m){
## 1) take a sample from posterior
i <- sample(ncol(post.samples), 1)
theta <- post.samples[i,]
## 2) forward simulation from model
Ystar[k,] <- simulate.monod.stoch(theta, Cstar)
}
We can also plots the predictions with uncertainty bands:
Ystar.quants <- apply(Ystar, MARGIN=2, FUN=quantile, probs=c(0.05, 0.5, 0.95))
## plot result
plot(Cstar, Ystar.quants[2,], ylab="r", ylim=c(0, 5))
polygon(c(Cstar,rev(Cstar)), c(Ystar.quants[1,],rev(Ystar.quants[3,])), col = "grey85")
lines(Cstar, Ystar.quants[2,], col=2, lwd=2, type="b")
We take the function for forward simulations from exercises 1:
# function to simulate stochastic realisations
function simulate_monod_stoch(C, par)
Ydet = model_monod(C, par)
z = rand(Normal(0, par.sigma), length(Ydet)) # adding noise
Ydet .+ z
end
## simulate_monod_stoch (generic function with 1 method)
m = 1000
## 1000
Cstar = [10,12,14,16]
## 4-element Vector{Int64}:
## 10
## 12
## 14
## 16
Ystar = Matrix{Float64}(undef, m, length(Cstar));
θ = copy(θinit);
for k in 1:m
i = rand(1000:10000)
θ .= res.X[:,i]
Ystar[k,:] = simulate_monod_stoch(Cstar, θ)
end
# compute quantile
low_quantile = [quantile(Ystar[:,i], 0.05) for i in 1:length(Cstar)];
med_quantile = [quantile(Ystar[:,i], 0.5) for i in 1:length(Cstar)];
upper_quantile = [quantile(Ystar[:,i], 0.95) for i in 1:length(Cstar)];
plot(Cstar, upper_quantile,
fillrange = low_quantile,
labels = false,
xlabel = "C",
ylabel = "r",
ylim=(0,5));
plot!(Cstar, med_quantile, marker=:circle,
labels = false)
If we are able to compute the gradient of the log density, \(\nabla \log p\), we can use much more efficient sampling methods. For small numbers of parameters this may not be relevant, for lager problems (more than 20 dimensions) the differences can be huge.
Julia is particularly well suited for these applications, because many libraries for Automatic Differentiation (AD) are available - methods that compute the gradient of (almost) any Julia function by analyzing the code.
Because there is no equally powerful AD available in R, we can do this exercise only with Julia.
Most gradient based samplers assume that all parameters are in \(\mathbb{R}\). If we have a parameter that is only defined on an interval, such as a standard deviation that is never negative, we need to transform the model parameter before sampling. For this we need three ingredients:
a function that maps every vector in \(\mathbb{R}^n\) to our “normal” model parameter space,
the inverse of this function,
the determinant of the Jacobian of this function.
The package TransformVariables helps us with these transformations. We need a function that takes a vector in \(\mathbb{R}^n\) to evaluate the posterior.
using TransformVariables
# defines the 'legal' parameter space, all parameter cannot be negative
# due to the lognormal prior
trans = as((r_max = asℝ₊, K = asℝ₊, sigma = asℝ₊))
# define a function that takes a parameter vector in ℝ^n
function logposterior_monod_Rn(par_Rn)
# transforms from ℝ^n to parameter space,
# and compute log of the determinante of the jacobian
par, logjac = TransformVariables.transform_and_logjac(trans, par_Rn)
# do not forget to add the log determinante!
logposterior_monod(ComponentVector(par)) + logjac
end
We can now sample in \(\mathbb{R}^n\) and later transform the samples to the model parameter space with:
TransformVariables.transform(trans, [-1,-1,-1]) # -> (r_max=0.367, K=0.367, sigma=0.367)
The last ingredient we need is a function that computes the gradient
of logposterior_monod_Rn. To do so we need to differentiate
through our model, the likelihood, the prior, the parameter
transformation, and the determinant of the Jaccobian. Needles to say,
even for our very simple model this would be very tedious to do
manually!
Instead we use Automatic
Differentiation (AD). Julia has multiple packages for AD that make
different trade-offs. We use ForwardDiff
that is well suited for smaller dimensions.
AD requires all your model code to be implemented in pure Julia! Otherwise there are few restrictions. For example, you can compute the gradient of the growth model even though it uses an advanced adaptive ODE solver internally.
import ForwardDiff
# derive a function that computes the gradient
∇logposterior_monod_Rn(par_Rn) = ForwardDiff.gradient(logposterior_monod_Rn, par_Rn)
# make sure that you can run both functions for all values in ℝ^n
par_test = randn(3)
logposterior_monod_Rn(par_test)
∇logposterior_monod_Rn(par_test)
Hamiltonian Monte Carlo (HMC) is one of the most powerful methods to sample in high dimensions. The “No-U-Turn Sampler” (NUTS) is a popular version. For example, it is used in STAN.
The package AdvancedHMC
provides building blocks that can be combined to construct various HMC
samplers. The function is a wrapper that provides a NUTS sampler as
implemented in STAN.
using AdvancedHMC
"""
# NUTS HMC sampler as used by STAN
Based on the documentation of AdvancedHMC.jl
```Julia
stanHMC(lp::Function, ∇lp::Function, θ_init;
n_samples::Int=1000, n_adapts::Int=n_samples÷2)
```
### Arguments
- `lp`: log density to sample from (up to a constant)
- `∇lp`: define how to get the log gradient of `lp`.
- `n_samples::Int=1000`: number of samples
- `n_adapts::Int=n_samples÷2`: length of adaptation
### Return Value
A named tuple with fields:
- `samples`: array containing the samples
- `stats`: contains various statistics form the sampler. See AdvancedHMc documentation.
"""
function stanHMC(lp::Function, ∇lp::Function, θ_init,;
n_samples::Int=1000, n_adapts::Int=n_samples÷2)
# Define a Hamiltonian system
D = length(θ_init) # number of parameters
metric = DiagEuclideanMetric(D)
hamiltonian = Hamiltonian(metric, lp, θr -> (lp(θr), ∇lp(θr)))
# Define a leapfrog solver, with initial step size chosen heuristically
initial_ϵ = find_good_stepsize(hamiltonian, θ_init)
integrator = Leapfrog(initial_ϵ)
# Define an HMC sampler, with the following components
# - multinomial sampling scheme,
# - generalised No-U-Turn criteria, and
# - windowed adaption for step-size and diagonal mass matrix
proposal = NUTS{MultinomialTS, GeneralisedNoUTurn}(integrator)
adaptor = StanHMCAdaptor(MassMatrixAdaptor(metric), StepSizeAdaptor(0.8, integrator))
# -- run sampler
samples, stats = sample(hamiltonian, proposal, θ_init, n_samples,
adaptor, n_adapts; progress=true)
return (samples=samples, stats=stats)
end
HMC samplers need many density- and gradient-evaluations to produce a
single proposal. However, the acceptance rate of an HMC sampler should
be close to one. We can use the wrapper function stanHMC
like this:
par_init = [-1.0, -1.0, -1.0] # in ℝⁿ
res = stanHMC(logposterior_monod_Rn,
∇logposterior_monod_Rn,
par_init;
n_samples = 100);
res.samples # this are the samples in ℝⁿ !
The samples we get are in \(\mathbb{R}^n\). Before we have a look, let’s transform them to the “normal” parameter space:
# Transform the samples to the "normal" space and convert to `Chains`
_chn = [ComponentVector(TransformVariables.transform(trans, res.samples[i]))
for i in 1:size(res.samples, 1)];
chn = Chains(_chn, [:r_max, :K, :sigma])
plot(chn)
corner(chn)
Livingstone and Zanella (2021) proposed a comparably simple MCMC algorithm that uses the gradient to adapt the jump distribution. It is a promising alternative to HMC in cases where the number of parameters is not very high, or if the gradient is expected to be noisy (which is often the case if a model uses adaptive ODE solvers).
BarkerMCMC.jl
implements it including an adaptation that aims at a given acceptance
rate.
using BarkerMCMC: barker_mcmc
par_init = [-1.0, -1.0, -1.0] # in ℝⁿ
# see `?barker_mcmc` for all options
res = barker_mcmc(logposterior_monod_Rn,
∇logposterior_monod_Rn,
par_init;
n_iter = 1000,
target_acceptance_rate=0.4);
res.samples # this are the samples in ℝⁿ !
res.log_p
The samples we get are in \(\mathbb{R}^n\). Before we have a look, let’s transform them to the “normal” parameter space:
using MCMCChains
using StatsPlots
# Transform the samples to the "normal" space and convert to `Chains`
_chn = [ComponentVector(TransformVariables.transform(trans, res.samples[i,:]))
for i in 1:size(res.samples, 1)];
chn = Chains(_chn, [:r_max, :K, :sigma])
plot(chn)
corner(chn)
First we make sure, that we can sample in an unlimited space by using appropriate transformations:
using TransformVariables
# defines the 'legal' parameter space, all parameter cannot be negative
# due to the lognormal prior
trans = as((r_max = asℝ₊, K = asℝ₊, sigma = asℝ₊))
## [1:3] NamedTuple of transformations
## [1:1] :r_max → asℝ₊
## [2:2] :K → asℝ₊
## [3:3] :sigma → asℝ₊
# define a function that takes a parameter vector in ℝ^n
function logposterior_monod_Rn(par_Rn)
# transforms from ℝ^n to parameter space,
# and compute log of the determinante of the jacobian
par, logjac = TransformVariables.transform_and_logjac(trans, par_Rn)
# do not forget to add the log determinante!
logposterior_monod(ComponentVector(par)) + logjac
end
## logposterior_monod_Rn (generic function with 1 method)
We get the gradient of the logposterior with AD:
import ForwardDiff
# derive a function that computes the gradient
∇logposterior_monod_Rn(par_Rn) = ForwardDiff.gradient(logposterior_monod_Rn, par_Rn)
## ∇logposterior_monod_Rn (generic function with 1 method)
using AdvancedHMC
function stanHMC(lp::Function, ∇lp::Function, θ_init,;
n_samples::Int=1000, n_adapts::Int=n_samples÷2)
# Define a Hamiltonian system
D = length(θ_init) # number of parameters
metric = DiagEuclideanMetric(D)
hamiltonian = Hamiltonian(metric, lp, θr -> (lp(θr), ∇lp(θr)))
# Define a leapfrog solver, with initial step size chosen heuristically
initial_ϵ = find_good_stepsize(hamiltonian, θ_init)
integrator = Leapfrog(initial_ϵ)
# Define an HMC sampler, with the following components
# - multinomial sampling scheme,
# - generalised No-U-Turn criteria, and
# - windowed adaption for step-size and diagonal mass matrix
proposal = NUTS{MultinomialTS, GeneralisedNoUTurn}(integrator)
adaptor = StanHMCAdaptor(MassMatrixAdaptor(metric), StepSizeAdaptor(0.8, integrator))
# -- run sampler
samples, stats = sample(hamiltonian, proposal, θ_init, n_samples,
adaptor, n_adapts; progress=true)
return (samples=samples, stats=stats)
end
## stanHMC (generic function with 1 method)
We use our wrapper function stanHMC. Note, that for HMC
we typically need a much lower number of samples.
par_init = [-1.0, -1.0, -1.0] # in ℝⁿ
## 3-element Vector{Float64}:
## -1.0
## -1.0
## -1.0
res = stanHMC(logposterior_monod_Rn,
∇logposterior_monod_Rn,
par_init;
n_samples = 100);
res.samples # this are the samples in ℝⁿ !
## 100-element Vector{Vector{Float64}}:
## [-0.8204851542060695, -0.9582519105125322, 0.4999387021305879]
## [-0.8204851542060695, -0.9582519105125322, 0.4999387021305879]
## [0.34690187482519796, -0.9946920475179062, 1.0225597247588047]
## [0.34331355031121924, -1.0029790142168087, 1.0591496564196854]
## [0.7870370745737327, -0.7019123242459222, 0.6379726154907976]
## [1.0058905439485335, -0.6269985470560638, 0.49533561815783267]
## [1.490869940260736, -0.4584185773810917, 0.34593040619523785]
## [1.139205681118193, -0.5381546103532856, 0.10646394078701807]
## [1.139205681118193, -0.5381546103532856, 0.10646394078701807]
## [1.3491786069325378, -0.880256918293122, -0.8609192102398182]
## ⋮
## [1.458300853430839, 0.30532437851814115, -0.6385228161023948]
## [1.4398606671389025, 0.17763103573262007, -0.6916261298971151]
## [1.3619941116192509, 0.02283684498416673, -0.6437012014472143]
## [1.3970595689397345, -0.0667970623823895, -0.7612051778995953]
## [1.5009023340978531, 0.3975354428598278, -0.745041358739944]
## [1.5655903347994187, 0.5099668960907338, -0.8729790105113338]
## [1.4225630472653825, 0.5538813780140697, -0.8742399169024534]
## [1.399596582483589, 0.3739358290903573, -0.8692538553155286]
## [1.4395438481123914, 0.3531697440586536, -0.8517308317434112]
We need to back-transform the samples to the “normal” space. For
plotting we convert to Chains:
_chn = [ComponentVector(TransformVariables.transform(trans, res.samples[i]))
for i in 1:size(res.samples, 1)];
chn = Chains(_chn, [:r_max, :K, :sigma])
## Chains MCMC chain (100×3×1 Array{Float64, 3}):
##
## Iterations = 1:1:100
## Number of chains = 1
## Samples per chain = 100
## parameters = r_max, K, sigma
##
## Summary Statistics
## parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
## Symbol Float64 Float64 Float64 Float64 Float64 Float64 Missing
##
## r_max 4.1113 0.8182 0.4170 3.1940 12.7391 1.2442 missing
## K 1.4200 0.5832 0.3236 3.2752 16.3597 1.2421 missing
## sigma 0.6051 0.4234 0.1640 6.1566 12.7391 1.1370 missing
##
## Quantiles
## parameters 2.5% 25.0% 50.0% 75.0% 97.5%
## Symbol Float64 Float64 Float64 Float64 Float64
##
## r_max 1.4120 3.9499 4.2986 4.5564 4.9354
## K 0.3836 1.0855 1.4435 1.6885 2.5653
## sigma 0.3932 0.4337 0.4822 0.5439 1.7767
plot(chn)
corner(chn[25:end,:,:])
The same steps are taken for the BarkerMCMC:
using BarkerMCMC: barker_mcmc
par_init = [-1.0, -1.0, -1.0]; # in ℝⁿ
## 3-element Vector{Float64}:
## -1.0
## -1.0
## -1.0
# see `?barker_mcmc` for all options
res = barker_mcmc(logposterior_monod_Rn,
∇logposterior_monod_Rn,
par_init;
n_iter = 1000,
target_acceptance_rate=0.4);
# Transform the samples to the "normal" space and convert to `Chains`
_chn = [ComponentVector(TransformVariables.transform(trans, res.samples[i,:]))
for i in 1:size(res.samples, 1)];
chn = Chains(_chn, [:r_max, :K, :sigma])
## Chains MCMC chain (1000×3×1 Array{Float64, 3}):
##
## Iterations = 1:1:1000
## Number of chains = 1
## Samples per chain = 1000
## parameters = r_max, K, sigma
##
## Summary Statistics
## parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
## Symbol Float64 Float64 Float64 Float64 Float64 Float64 Missing
##
## r_max 4.2981 0.6687 0.1300 72.6285 11.5958 1.0183 missing
## K 1.5809 0.5193 0.0686 69.7629 13.1021 1.0147 missing
## sigma 0.5478 0.3692 0.1029 22.5973 12.3707 1.0457 missing
##
## Quantiles
## parameters 2.5% 25.0% 50.0% 75.0% 97.5%
## Symbol Float64 Float64 Float64 Float64 Float64
##
## r_max 1.4680 4.1692 4.3815 4.6296 5.1397
## K 0.3523 1.2934 1.5762 1.8869 2.5812
## sigma 0.3594 0.4178 0.4714 0.5333 1.9896
plot(chn)
corner(chn[250:end,:,:])